This week’s topic was about exploring statistical data
I load the Boston data from MASS libray into R. Boston is default data set in R.
library(dplyr);library(tidyr);library(ggplot2)
library(boot);library(MASS);library(corrplot)
library(plotly)
#library(tidyverse)
#Load the Boston data from the MASS package.
data("Boston")
The Boston data set contains information about factors effecting housing values in suburban areas of Boston. The data includes e.g crimerate (crim), full value property tax rate per 10000$ (tax) and pupil- teacher ratio (pratio).
#Explore the structure and the dimensions of the data
str(Boston)
## 'data.frame': 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
summary(Boston)
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08204 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio black
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## lstat medv
## Min. : 1.73 Min. : 5.00
## 1st Qu.: 6.95 1st Qu.:17.02
## Median :11.36 Median :21.20
## Mean :12.65 Mean :22.53
## 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :37.97 Max. :50.00
From the summary, we can see that all the variables in this data frame are numeric.
# The correlation matrix
cor_matrix<-cor(Boston) %>% round(digits = 2)
# Plotting the correlation matrix
corrplot(cor_matrix, method="circle", type="upper", cl.pos = "b",tl.pos = "d",tl.cex = 0.6)
Above is the correlation chart of the values. In there it’s visible that rad (index of accessibility to radial highways) correlates positively to dis (weighted mean of distances to five Boston employment centres) and lstat(lower status of the population (percent)) correlates positively with medv (median value of owner-occupied homes in $1000s).
The data must be scaled beore doing any further analysis from it.
boston_scaled <- scale(Boston)
boston_scaled <- as.data.frame(boston_scaled)
Crime rates are categorized to four categories
# create a quantile vector of crim and print it
bins <- quantile(boston_scaled$crim)
# create a categorical variable 'crime'
crime <- cut(boston_scaled$crim, breaks = bins, include.lowest = TRUE, label=c("low", "med_low","med_high","high"))
# save it
boston_scaled <- dplyr::select(boston_scaled, -crim)
boston_scaled <- data.frame(boston_scaled, crime)
Randomly pick training and testing data
# choose randomly 80% of the rows
n <- nrow(boston_scaled)
ind <- sample(n, size = n * 0.8)
# create train and test set
train <- boston_scaled[ind,]
test <- boston_scaled[-ind,]
# save the correct classes from test data
correct_classes <- test$crime
# remove the crime variable from test data
test <- dplyr::select(test, -crime)
Linear discriminant analysis, in which I use categorical crime rate as the target variable and all the other variables as predictor variables.
lda.fit <- lda(crime ~., data = train)
# print the lda.fit object
lda.fit
## Call:
## lda(crime ~ ., data = train)
##
## Prior probabilities of groups:
## low med_low med_high high
## 0.2524752 0.2599010 0.2475248 0.2400990
##
## Group means:
## zn indus chas nox rm
## low 0.97941010 -0.98171566 -0.117932980 -0.8756611 0.5095695
## med_low -0.08031574 -0.23590590 -0.009855719 -0.5304511 -0.1707712
## med_high -0.37404455 0.08896355 0.200122961 0.3222790 0.1256098
## high -0.48724019 1.01721867 -0.109974419 1.0597364 -0.4384570
## age dis rad tax ptratio black
## low -0.8673216 0.8996590 -0.6947544 -0.7406985 -0.4386490 0.3826578
## med_low -0.2637219 0.3359894 -0.5443600 -0.4568671 -0.0670028 0.3272386
## med_high 0.3414028 -0.3316590 -0.3766291 -0.3229980 -0.2639944 0.1173009
## high 0.8376067 -0.8700253 1.6371072 1.5133254 0.7795879 -0.8789726
## lstat medv
## low -0.814431632 0.5883712
## med_low -0.067085718 -0.0159933
## med_high -0.003915838 0.2116094
## high 0.957036476 -0.7610892
##
## Coefficients of linear discriminants:
## LD1 LD2 LD3
## zn 0.122311233 0.76129508 -0.63606625
## indus 0.008265984 -0.42176411 0.58482782
## chas -0.095586675 -0.08728622 0.01938195
## nox 0.472665518 -0.63403754 -1.62060573
## rm -0.115951055 -0.03241037 -0.19573621
## age 0.280737808 -0.20704065 -0.05850912
## dis -0.059128393 -0.27690382 0.07237108
## rad 3.036995267 0.75458201 0.08371175
## tax -0.071074574 0.16243360 0.39284080
## ptratio 0.118185684 0.06767660 -0.36671221
## black -0.148982845 -0.02962798 0.14013688
## lstat 0.209399512 -0.24494411 0.29511596
## medv 0.193628064 -0.40787376 -0.32877666
##
## Proportion of trace:
## LD1 LD2 LD3
## 0.9477 0.0371 0.0151
Plotting LDA results with a biplot
# the function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
heads <- coef(x)
arrows(x0 = 0, y0 = 0,
x1 = myscale * heads[,choices[1]],
y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
text(myscale * heads[,choices], labels = row.names(heads),
cex = tex, col=color, pos=3)
}
# target classes as numeric
classes <- as.numeric(train$crime)
# plot the lda results
plot(lda.fit, dimen = 2, col = classes, pch = classes)
lda.arrows(lda.fit, myscale = 2)
Calculate LDA predicting performance
# predict classes with test data
lda.pred <- predict(lda.fit, newdata = test)
# cross tabulate the results
table(correct = correct_classes, predicted = lda.pred$class)
## predicted
## correct low med_low med_high high
## low 14 11 0 0
## med_low 6 14 1 0
## med_high 1 6 19 0
## high 0 0 0 30
Loading the Boston data again and preparing it for clustering
data('Boston')
Boston <- scale(Boston)
dist_eu <- dist(Boston)
summary(dist_eu)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1343 3.4620 4.8240 4.9110 6.1860 14.4000
n <- nrow(boston_scaled)
ind <- sample(n, size = n * 0.8)
train <- boston_scaled[ind,]
test <- boston_scaled[-ind,]
correct_classes <- test$crime
test <- dplyr::select(test, -crime)
Clustering the data with 4 cluter centroids
km <-kmeans(Boston, centers = 4)
pairs(Boston, col = km$cluster)
Finding the optimal amount of clusters by calculating total of within cluster sum of squares (WCSS) for clusters between 1 to 10.
# calculate WCSS
set.seed(123)
k_max <- 10
twcss <- sapply(1:k_max, function(k){kmeans(Boston, k)$tot.withinss})
# visualize the results
qplot(x = 1:k_max, y = twcss, geom = 'line')
Optimal amount of cluster centroids seems to be 2 from the plot above. The best number of clusters is when the total WCSS drops radically.
K-means clustering again but with 2 cluster centroids.
km <-kmeans(Boston, centers = 2)
pairs(Boston, col = km$cluster)
model_predictors <- dplyr::select(train, -crime)
# check the dimensions
dim(model_predictors)
## [1] 404 13
dim(lda.fit$scaling)
## [1] 13 3
# matrix multiplication
matrix_product <- as.matrix(model_predictors) %*% lda.fit$scaling
matrix_product <- as.data.frame(matrix_product)
3D plot with crime categories
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color=train$crime)
train$crime
## [1] med_high med_low high low med_low med_high high
## [8] high low high low med_high med_low high
## [15] high med_high high med_low med_high high low
## [22] high low low low high med_high low
## [29] low med_low med_high low high med_high low
## [36] med_high low med_high med_high med_low med_high med_low
## [43] low med_high med_high med_high high med_low med_high
## [50] med_high med_high high med_low high low high
## [57] high med_high med_high high med_low low med_low
## [64] med_high med_high low high low low low
## [71] low high med_low high low low high
## [78] high high med_low low high med_high low
## [85] med_low med_low med_high med_low med_high med_low low
## [92] low med_high med_high med_high low low med_low
## [99] med_high med_low med_low high low med_low med_low
## [106] low high med_low high med_low med_high high
## [113] med_high med_low med_low low med_low med_high med_high
## [120] high med_low high med_low med_high high med_high
## [127] high high low med_high high high low
## [134] high low med_low low med_low med_low med_high
## [141] high low med_high high med_low low med_high
## [148] high low high med_low low med_high med_low
## [155] med_low med_low med_high low med_high med_low med_high
## [162] med_low med_low high high med_low med_high low
## [169] med_low high high med_high med_high high med_low
## [176] med_low high med_high high med_high med_high med_low
## [183] med_low high med_low med_high med_low low med_high
## [190] med_high med_high med_low med_high low med_high low
## [197] high low med_low high med_high med_high med_high
## [204] med_high high low med_high med_high med_high high
## [211] high med_high high high med_high med_low high
## [218] low med_high low med_low med_high low med_low
## [225] med_high high high med_high med_high high med_high
## [232] high med_high high low med_high med_high med_high
## [239] high med_high high med_high med_low low low
## [246] low low med_low low med_high low high
## [253] low high med_low low med_high med_high med_high
## [260] med_low high med_low low low med_low low
## [267] med_low low med_low med_low med_low med_low med_low
## [274] high med_low med_low med_low low med_low high
## [281] low med_high med_low low med_low med_low med_low
## [288] high high med_low med_low med_low low low
## [295] med_low med_high low med_low low med_high med_low
## [302] high low high med_low med_low high med_high
## [309] low high low med_high high med_low low
## [316] low high med_low high med_high low med_low
## [323] high med_high high med_low med_high med_high high
## [330] low low low med_high high low low
## [337] med_low high low low med_high med_high high
## [344] high high high med_low med_low low med_low
## [351] med_low med_high high high med_low high med_low
## [358] high high med_low med_low low med_high low
## [365] high med_low high high med_high high low
## [372] med_low med_high med_low med_low high low low
## [379] low med_high high med_low low low low
## [386] high med_high low med_high med_high med_high med_low
## [393] low med_high high low low high high
## [400] med_high med_low high high med_low
## Levels: low med_low med_high high
lda.pred$class
## [1] med_low med_low med_high low low med_low med_low
## [8] med_low med_low low med_low med_low low med_low
## [15] med_low med_low med_low med_low med_low low med_low
## [22] low med_low med_low med_low med_low med_low med_low
## [29] med_low med_high med_high med_high med_high med_high med_high
## [36] med_high med_high med_high med_high med_high med_high med_high
## [43] med_high med_high med_low med_high med_high low low
## [50] med_high low low low low low low
## [57] low low med_low low low low med_low
## [64] med_low med_low med_low med_low low med_low med_low
## [71] low high high high high high high
## [78] high high high high high high high
## [85] high high high high high high high
## [92] high high high high high high high
## [99] high high high med_high
## Levels: low med_low med_high high